On a Supposed Conceptual Inadequacy of the 
Shannon Information in Quantum Mechanics 

C.G. Timpson* 
The Queen's College, Oxford, 0X1 4AW, UK 

18 December 2002 



Abstract 

Recently, Brukner and Zeilinger (2001) have claimed that the Shannon 
information is not well defined as a measure of information in quantum 
mechanics, adducing arguments that seek to show that it is inextrica- 
bly tied to classical notions of measurement. It is shown here that these 
arguments do not succeed: the Shannon information does not have prob- 
lemàtic ties to classical concepts. In a further argument, Brukner and 
Zeilinger compare the Shannon information unfavourably to their pre- 
ferred information measure, I(p), with regard to the definition of a notion 
of 'total information content'. This argument is found unconvincing and 
the relationship between individual measures of information and notions 
of 'total information content' investigated. We close by considering the 
prospects of Zeilinger's Foundational Principle as a foundational principle 
for quantum mechanics 



1 Introduction 

What role the concept of information might have to play in the foundations 
of quantum mechanics is a question that has recently excited renewed interest 
(Fuchs 2000, 2002; Mermin 2001; Wheeler 1990). Zeilinger, for example, has put 
forward an information-theoretic principle which he suggests might serve as a 
foundational principle for quantum mechanics (Zeilinger 1999), (see Appendix). 
As a part of this project, Brukner and Zeilinger (2001) have criticised Shannon's 
(1948) measure of information, the quantity fundamental to the discussion of 
information in both classical and quantum information theory. They claim that 
the Shannon information is not appropriate as a measure of information in the 
quantum context and have proposed in its stead their own preferred quantity 
and a notion of 'total information content' associated with it, which latter is 
supposed to supplant the von Neumann entropy (Brukner and Zeilinger 1999, 
2000a, 2000b). 
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The main aim in Brukner and Zeilinger (2001) is to establish that the Shan- 
non information is intimately tied to classical notions, in particular, to the pre- 
conceptions of classical measurement, and that in consequence it cannot serve 
as a measure of information in the quantum context. They seek to establish 
this in two ways. First, by arguing that the Shannon measure only makes sense 
when we can take there to be a pre-existing sequence of bit vàlues in a message 
we are decoding, which is not the case in general for measurements on quantum 
systems (consider measurements on qubits in a basis different from their eigen- 
basis); and second, by suggesting that Shannon's famous third postulate, the 
postulate that secures the uniquencss of the form of the Shannon information 
measure (Shannon, 1948) and has been seen by many as a necessary axiom for 
a measure of information, is motivated by classical preconceptions and does not 
apply in general in quantum mcchanics where we must consider non-commuting 
observables. 

These two arguments do not succeed in showing that the Shannon informa- 
tion is 'intimately tied to the notion of systems carrying properties prior to and 
independent of observation' (Brukner and Zeilinger 2000b:l), however. The hrst 
is based on too narrow a conception of the meaning of the Shannon information 
and the second, primarily, on a misreading of what is known as the 'grouping 
axiom'. We shall see that the Shannon information is perfectly well defined and 
appropriate as a measure of information in the quantum context as well as in 
the classical. We will begin by reviewing some of the different ways in which 
the Shannon information may be understood (Section|2J|, before examining this 
pair of arguments and seeing where they go wrong (SectionEJ). 

Brukner and Zeilinger have a further argument against the Shannon infor- 
mation (Section0J). They suggest it is inadequate because it cannot be used to 
define an acceptable notion of 'total information content'. Equally, they insist, 
the von Neumann entropy cannot be a measure of information content for a 
quantum system because it has no general relation to information gain from 
the measurements that we might perform on a system, save in the case of mea- 
surement in the basis in which the density matrix is diagonal. By contrast, for 
a particular set of measurements, their preferred information measure sums to 
a unitarily invariant quantity that they intèrpret as 'information content', this 
being one of their primary reasons for adopting this specific measure. This prop- 
erty will be seen to have a simple geomètric explanation in the Hilbert-Schmidt 
representation of density operators however, rather than being of any great in- 
formation theoretic significance; and this final argument found unpersuasive, as 
the proposed constraint on any information measure regarding the dehnition 
of 'total information content' seems unreasonable. Part of the problem is that 
information content, total or otherwise, is not a univocal concept and we need 
to be careful to specify precisely what we might mean by it in any given context. 



2 



2 Interpretation of the Shannon information 



The technical concept of information relevant to our discussion, the Shannon 
information, finds its home in the context of communication theory. We are 
concerned with a notion of quantity of information; and the notion of quantity 
of information is cashed out in terms of the resources rcquircd to transmit 
messages (which is, note, a very limited sense of quantity). We shall highlight 
two main ways in which the Shannon information may be understood, the first 
of which rests explicitly on Shannon's 1948 noiseless coding theorem. 

2.1 The communication channel 

It is instructive to begin with a quotation of Shannon's: 

The fundamental problcm of communication is that of reproducing 
at one point either exactly or approximately a messagc selectcd at 
another point. Frequently these messages have meaning... These se- 
màntic aspects of communication are irrelevant to the enginccring 
problem. (Shannon 1948: 31) 

The communication system consists of an information source, a transmittcr or 
encoder, a (possibly noisy) channel, and a receiver (decoder). It must be able 
to deal with any possible message produced (a string of symbols selected in the 
source, or some varying waveform), hence it is quite irrelevant whether what is 
actually transmitted has any meaning or not. 

It is crucial to realise that 'information' in Shannon's theory is not associated 
with individual messages, but rather characterises the source of the messages. 
The point of characterising the source is to discover what capacity is required 
in a Communications channel to transmit all the messages the source produces; 
and it is for this that the concept of the Shannon information is introduced. 
The idea is that the statistical nature of a source can be used to reduce the 
capacity of channel rcquircd to transmit the messages it produces (we shall 
restrict ourselves to the case of discrete messages for simplicity). 

Consider an ensemblc X of letters {xi,x 2 , ■ ■ ■ ,x n } oceurring with proba- 
bilitics pi. This ensemble is our source, from which messages of N letters are 
drawn. We are concerned with messages of very large N. For such messages, 
we know that typical sequences of letters will contain Npi of letter xí , Npj of 
Xj and so on. The number of distinct typical sequences of letters is then given 




Np 1 \Np 2 l . . . Np n \ 



and using Stirling's approximation, this becomes 2 NH ( X ^ where 



n 




(1) 
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is the Shannon information (logarithms are to base 2 to fix the units of infor- 
mation as binary bits). 

Now as N — > oo, the probability of an atypical sequence appearing becomes 
negligible and we are left with only 2 NH ( X ) equiprobable typical sequences which 
need ever be considered as possible messages. We can thus replace each typical 
sequence with a binary code number of NH(X) bits and send that to the receiver 
rather than the original message of N letters (Nlogn bits). 

The message has been compressed from N letters to NH{X) bits (< Nlogn 
bits). Shannon's noiseless coding theorem, of which this is a rough sketch, states 
that this represents the optimal compression (Shannon 1948). The Shannon 
information is, then, appropriately called a measure of information because it 
represents the maximum amount that messages consisting of letters drawn from 
an ensemble X can be compressed. 

One may also make the derivative statement that the information per letter 
in a message is H(X) bits, which is equal to the information of the source. 
But 'derivative' is an important qualification: we can only consider a letter Xi 
drawn from an ensemble X to have associated with it the information H(X) if 
we consider it to be a member of a typical sequence of N letters, where N is 
large, drawn from the source. 

Note also that we must strenuously resist any temptation to conclude that 
because the Shannon information telis us the maximum amount a message 
drawn from an ensemble can be compressed, that it therefore telis us the ir- 
rcduciblc mcaning content of the message, specificd in bits, which somehow 
possess their own intrinsic meaning. This idea rests on a failure to distinguish 
between a code, which has no concern with mcaning, and a language, which 
does (cf. Timpson (2000), Chpt.5). 

2.2 Information and Uncertainty 

Another way of thinking about the Shannon information is as a measure of the 
amount of information that we expect to gain on performing a probabilistic ex- 
periment. The Shannon measure is a measure of the uncertainty of a probability 
distribution as well as serving as a measure of information. A measure of un- 
certainty is a quantitative measure of the lack of concentration of a probability 
distribution; this is called an uncertainty because it measures our uncertainty 
about what the outcome of an experiment completely described by the prob- 
ability distribution in question will be. Uffink (1990) provides an axiomàtic 
characterisation of measures of uncertainty, deriving a general class of measures 
of which the Shannon information is one (see also Maassen and Uffink 1989). 

Imagino a truly random probabilistic experiment described by a probabil- 
ity distribution p = {p\, . . . ,p n }. The intuitive link between uncertainty and 
information is that the greater the uncertainty of this distribution, the more 
we stand to gain from learning the outcome of the experiment. In the case of 
the Shannon information, this notion of how much we gain can be made more 
precise. 
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Some care is required when we ask 'how much do we know about the out- 
come?' for a probabilistic experiment. In a certain sense, the shape of the 
probability distribution might provide no information about what an individual 
outcome will actually be, as any of the outcomes assigned non-zero probability 
can occur. However, we can use the probability distribution to put a value on 
any given outcome. If it is a likely one, then it will be no surprise if it occurs 
so of little value; if an unlikely one, it is a surprise, hence of higher value. A 
nice measure for the value of the occurrence of outcome i is — logpi, a decreas- 
ing function of the probability of the outcome. We may call this the 'surprise' 
information associated with outcome i; it measures the value of what we learn 
from the experiment given that we know the probability distribution for the 
outcomes. 

If the information that we would gain if outcome i were to occur is — \ogpi, 
then before the experiment, the amount of information we expect to gain is given 
by the expectation value of the 'surprise' information, J2íPí(~ ^°SPi)> an( i this, 
of course, is just the Shannon information H of the probability distribution p. 
Hence the Shannon information telis us our expected information gain. 

More generally, a crude sketch of how the relationship between uncertainty 
and expected information gain might be cashed out for the whole class of mea- 
sures of uncertainty may be given as follows. What we know given the proba- 
bility distribution for an experiment is that if the experiment is repeated very 
many times then the sequence of outcomes we attain will be one of the typical 
sequences. How much we learn from actually performing the experiments and 
acquiring one of those sequences, then, will depend on the number of typical se- 
quences; the more there are, the more we stand to gain. Thus for a quantitative 
measure of how much information we gain from the sequence of experiments we 
could just count the number of typical sequences (which would give us NH, the 
Shannon information of the sequence), or we could choose any suitably behaved 
function (e.g. continuous, invariant under relabelling of the outcome probabili- 
ties) that increases as the number of typical sequences increases. This property 
will follow from Schur concavity, which is the key requirement on Ufhnk's gen- 
eral class of uncertainty measures U r (p) (for details of the property of Schur 
concavity, see Uffink (1990), Nielsen (2001) and Section O below). NU r (p) 
then, can be understood as a measure of the amount of information we gain 
from a long series N of experiments; we get the average or expected informa- 
tion per measurement by dividing by N, but note that this quantity only makes 
sense if we consider the individual measurement as part of a long sequence of 
measurements . 

A precisely similar story can be told for a measure of 'how much we know' 
given a probability distribution. This will be the inverse of an uncertainty: we 
want a measure of the concentration of a probability distribution; the more 
concentrated, the more we know about what the outcome will be (which just 
means, the better we can predict the outcome). A function that decreases 
as the number of typical sequences increases (Schur convexity) will give our 
quantitative measure of how much we know about what the outcome of a long 
run of experiments will be: the more typical sequences there are the less we 
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know. We are again talking of a long run of experiments, so 'how much we 
know' for a single experiment will only make sense as an average value, when 
the single experiment is considered as a member of a long sequence. So note 
again that to say we have a certain amount of information (knowledge) about 
what the outeome of an experiment will be is not to claim that we have partial 
knowledge of some predetermined fact about the outeome of an experiment. 

2.3 The minimum number of qüestions needed to specify 
a sequence 

The final common interpretation of the Shannon information is as the minimum 
average number of binary qüestions needed to specify a sequence drawn from 
an ensemble (Uffink 1990; Ash 1965), although this appears not to provide an 
interpretation of the Shannon information actually independent of the previous 
two. 

Imagine that a long sequence N of letters is drawn from the ensemble X, or 
that N independent experiments whosc possible outeomes have probabilities Pi 
are performed, but the list of outeomes is kept from us. Our task is to determine 
what the sequence is by asking qüestions to which the guardian of the sequence 
can only answer 'yes' or 'no'; and we choosc to do so in such a manner as to 
minimize the average number of qüestions needed. We need to be concerned 
with the average number to rule out lucky guesses identifying the sequence. 

If we are trying to minimize the average number of qüestions, it is evident 
that the best questioning strategy will be one that attempts to rule out half 
the possibilities with each question, for then whatever the answer turns out to 
be, we still get the maximum value from each question. Given the probability 
distribution, we may attempt to implement this strategy by dividing the possible 
outeomes of each individual experiment into classes of equal probability, and 
then asking whether or not the outeome lies in one of these classes. We then 
try and repeat this process, dividing the remaining set of possible outeomes into 
two sets of equal probabilities, and so on. It is in general not possible to proceed 
in this manner, dividing a finite set of possible outeomes into two sets of equal 
probabilities, and it can be shown that in consequence the average number 
of qüestions required if we ask about each individual experiment in isolation 
is greater than or equal to H(X). However, if we consider the N repeated 
experiments, where N tends to infinity, and consider asking joint qüestions 
about what the outeomes of the independent experiments were, we can always 
divide the classes of possibilities of (joint) outeomes in the required way. Now we 
already know that for large N, there are 2 NH ^ typical sequences, so given that 
we can strike out half the possible sequences with each question, the minimum 
average number of qüestions needed to identify the sequence is NH(X). (These 
last results are again essentially the noiseless coding theorem.) 

It is not immediately obvious, however, why the minimum average number 
of qüestions needed to specify a sequence should be related to the notion of 
information. (Again, the tendeney to think of bits and binary qüestions as 
irreducible meaning elements is to be resisted.) It seems, in fact that this is 
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cither just another way of talking about the maximum amount that messages 
drawn from a given ensemble can be compressed, in which case we are back to 
the interprctation of the Shannon information in terms of the noiseless coding 
theorem, or it is providing a particular way of characterising how much we stand 
to gain from learning a typical sequence, and we return to an interpretation in 
terms of our expected information gain. 

3 Two arguments against the Shannon informa- 
tion 

With this preamble behind us, we may turn to the first of the arguments against 
the Shannon information. 

3.1 Are pre-existing bit- vàlues required? 

Since the quantity — 'YluPi^ogPi is mcaningful for any (discrctc) probability 
distribution (and can be generalised for continuous distributions), Brukner and 
Zeilinger's argument must be that whcn we have probabilities arising from mea- 
surements on quantum systems, — ^2nPi logpi does not correspond to a concept 
of information. Their argument concerns measurements on systems that are 
all prepared in a given state \ip), where \il>) may not be an eigenstate of the 
observable we are measuring. The probability distribution p — {pi, . . . ,p n } for 
measurement outcomes will be given by pt = Tr(\tp)(il;\Pi), where Pi are the op- 
erators corresponding to different measurement outcomes (projection operators 
in the spectral decomposition of the observable, for projective measurements). 

Brukner and Zeilinger suggest that the Shannon information has no meaning 
in the quantum case, because the concept lacks an 'operational definition' in 
terms of the number of binary qüestions needed to specify an actual concrete 
sequence of outcomes. In general in a sequence of measurements on quantum 
systems, we cannot consider there to be a pre-existing sequence of possessed 
vàlues, at least if we accept the orthodox eigenvalue-eigenstate link for the 
ascription of definite vàlues (see e.g. Bub (1997)) 1 , and this rules out, they 
insist, interpreting the Shannon measure as an amount of information: 

The nonexistence of well-defined bit vàlues prior to and indepen- 
dent of observation suggests that the Shannon measure, as defined 
by the number of binary qüestions needed to determine the partic- 
ular observed sequence O's and l's, becomes problemàtic and even 

1 In a footnote, Brukner and Zeilinger suggest that the Kochcn-Spccker theorem in partic- 
ular raises problems for the operational definition of the Shannon information. It is not clear, 
howcver, why the impossibility of assigning context independent yes/no answers to qüestions 
asked of the system should be a problem if we are considering an operational definition. Pre- 
sumably such a definition would include a concrete specification of the experimental situation, 
i.e. refer to the context, and then we are not concerned with assigning a value to an operator 
but to the outeome of a specified experimental procedure, and this can be done perfectly con- 
sistcntly, if we so wish. The de-Broglic Bohm thcory, of coursc, provides a concrete example 
(Bell 1982). 
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untcnablc in defining our unccrtainty as given before the measure- 
mcnts are performed. (Brukncr and Zcilingcr 2001:1) 

...No definite outcomes exist before measurements are performed and 
therefore the number of different possible sequences of outcomes does 
not characterize our uncertainty about the individual system before 
measurements are performed. (Brukner and Zeilinger 2001:3) 

These two statements should immcdiatcly worry us, however. Recali the key 
points of the interpretation of the Shannon information: given a long message 
(a long run of experiments), we know that it will be one of the typical sequences 
that is instantiated. Given p, we can say what the typical sequences will be, 
how many there are, and hence the number of bits (NH(X)) needed to spec- 
ify them, independent of whcthcr or not there is a pre-existing sequence of bit 
vàlues. It is irrelevant whether there already is some concrete sequence of bits 
or not; all possible sequences that will be produced will require the same num- 
ber of bits to specify them as any sequence produced will always be one of the 
typical sequences. It clearly makes no diffcrcnce to this whether the probability 
distribution is given classically or comes from the trace rulc. Also, the number 
of different possible sequences does indeed teli us about our unccrtainty before 
measurement: what we know is that one of the typical sequences will be instan- 
tiated, what we are ignorant of is which one it will be, and we can put a measure 
on how ignorant we are simply by counting the number of different possibili- 
ties. Brukner and Zeilinger's attempted distinction between uncertainty before 
and after measurement is not to the point, the uncertainty is a function of the 
probability distribution and this is perfectly well defined before measurement 2 . 

Brukner and Zeilinger have assumed that it is a necessary and sufficient 
condition to understand H as a measure of information that there exists some 
concrete string of N vàlues, for then and only then can we talk of the minimum 
number of binary qüestions needed to specify the string. But as we have now 
seen, it is not a necessary condition that there exist such a sequence of outcomes. 

We are not in any case forced to assume that H is about the number of 
qüestions needed to specify a sequence in order to understand it as a measure of 
information; we also have the interpretations in terms of the maximum amount 
a message drawn from an ensemble described by the probability distribution 
p can be compressed, and as the expected information gain on measurement. 
(And as we have seen, one of these two interpretations must in fact be prior.) 
Furthermore, the absence of a pre-existing string need not even be a problem 
for the minimum average qüestions interpretation — we can ask about the 
minimum average number of qüestions that would be required if we were to 
have a sequence drawn from the ensemble. So again, the pre-existence of a 
definite string of vàlues is not a necessary condition. 

It is not a sufficient condition either, because, faced with a string of N 

2 We may need to enter at this point the important note that the Shannon information is 
not supposed to describe our general uncertainty when we know the state, this is a job for a 
measure of mixedness such as the von Neumann entropy, see below. 
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dcfinite outcomcs, in ordcr to intèrpret NH as the minimum average number 
of qüestions needed to specify the sequence, we need to know triat we in fact 
have a typical sequence, that is, we need to imagine an ensemble of such typical 
sequences and furthermore, to assume that the relative freqüències of each of 
the outeomes in our actual string is representative of the probabilities of each 
of the outeomes in the notional ensemble from which the sequence is drawn. 
If we do not make this assumption, then the minimum number of qüestions 
needed to specify the state of the sequence must be N — we cannot imagine 
that the statistical nature of the source from which the sequence is notionally 
drawn allows us to compress the message. So even in the classical case, the 
concrete sequence on its own is not enough and we need to consider an ensemble, 
either of typical sequences or an ensemble from which the concrete sequence is 
drawn. In this respect the quantum and classical cases are completely on a 
par. The same assumption needs to be made in both cases, namely, that the 
probability distribution p, either known in advance, or derived from observed 
relative freqüències, correctly describes the probabilities of the different possible 
outeomes. The fact that no determinate sequence of outeomes exists before 
measurement does not pose any problems for the Shannon information in the 
quantum context. 

Reiterating their requirements for a satisfactory notion of information, Brukncr 
and Zeilinger say: 

We require that the information gain be directly based on the ob- 
served probabilities, (and not, for example, on the precise sequence 
of individual outeomes observed on which Shannon's measure of in- 
formation is based). (Brukner and Zeilinger 2000b:l) 

But as we have seen, it is false that the Shannon measure must be based on a 
precise sequence of outeomes (this is not a necessary condition) and the Shannon 
measure already is and must be based on the observed probabilities (a sequence 
of individual outeomes on its own is not sufhcient). 

There is, however, a difference between the quantum and classical cases 
that Brukner and Zeilinger may be attempting to capture. Suppose we have 
a sequence of N qubits that has actually been used to encode some informa- 
tion, that is, the sequence of qubits is a channel to which we have connected 
a classical information source. For simplicity, imagine that we have coded in 
orthogonal states. Then the state of the sequence of qubits will be a prod- 
uct of |0)'s and |l)'s and for measurements in the encoding basis, the sequence 
will have a Shannon information cqual to NH(A) where H(A) is the infor- 
mation of the classical source. If we do not measure in the encoding basis, 
however, the sequence of O's and l's we get as our outeomes will differ from 
the vàlues originally encoded and the Shannon information of the resulting se- 
quence will be greater than that of the original 3 . We have introduced some 

3 We may think of our initial sequence of qubits as forming an ensemble described by the 
density operator p = pi|0)(0| + í>2 where pi,P2 are the probabilities for and 1 in our 
original classical information source. Any (projective) measurement that does not project 
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'noise' by measuring in the wrong basis. The way we describe this sort of sit- 
uation, though (Schumachcr 1995), is to use the Shannon mutual information 
H(A : B) = H(A) — H(A\B), where B denotes the outcome of measurement of 
the chosen observable (outcomes bi with probabilities p{bi)) and the 'conditional 
entropy' H(A\B) = J2i=i p(bi)H(p(ai\bi), . . . ,p{a m \bi)), characterises the noise 
we have introduced by measuring in the wrong basis. H(B) is the information 
(per letter) of the sequence that we are left with after measurement, H(A : B) 
telis us the amount of information that we have actually managcd to transmit 
down our channel, i.e. the amount (per letter) that can be decoded when we 
measure in the wrong basis. 

3.2 The grouping axiom 

The first argument has not revealed any difhculties for the Shannon information 
in the quantum context, so let us now turn to the second. 

In his original paper, Shannon put forward three properties as reasonable 
requirements on a measure of uncertainty and showed that the only function 
satisfying these requirements has the form H = —KJ^íPí logPi- 4 

The first two requirements are that H should be continuous in the pi and 
that for equiprobable events (pi — í/n), H should be a monotonic increasing 
function of n. The third requirement is the strongest and the most important 
in the uniqueness proof. It states that if a choice is broken down into two 
successive choices, the original H should be a weighted sum of the individual 
vàlues of H. The meaning of this rather non-intuitive constraint is usually 
demonstrated with an example (see Fig. A precise statement of Shannon's 
third requirement (one that includes also the second requirement as a special 
case) is due to Faddeev (1957) and is often known as the Faddeev grouping 
axiom: 

Grouping Axiom 1 (Faddeev) For every n > 2 

H(pi,p2, ■ ■ ■ ,Pn-x,qi,q<2) = H(p 1: .. . ,p n -i,p n ) +p n H( — , — ) (2) 

Pn Pn 

where p n = q\ + q2 ■ 

The form of the Shannon information follows uniquely from requiring H(p, 1 — p) 
to be continuous for < p < 1 and positive for at least one value of p, permu- 
tation invariance of H with respect to relabelling of the pi, and the grouping 
axiom. 

onto the eigenbasis of p will result in a post-measurement ensemble that is more mixed than 
p (see e.g. Nielsen (2001), Peres (1995) and below) and hence will have a greater uncertainty, 
thus a greater Shannon information, or any other measure of information gain. 

4 In contrast to some later writers, however, notably Jaynes (1957), he set little store 
by this derivation, seeing the justification of his measure as lying rather in its implications 
(Shannon 1948). Save the noiseless coding thcorem, the most significant of the implications 
that Shannon goes on to draw are, as has been pointed out by Uffink, consequences of the 
property of Schur concavity and hence shared by the general class of measures of uncertainty 
derived in Uffink (1990). 
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Figure 1 : The probability distribution p — {è ; 5 j g } can be considered as 
given for the three outcomes directly, or we could consider first a choice of two 
equiprobable events, followed by a second choice of two events with probabilitics 
3)3, conditional on the second, say, of the first two events occurring, a 'decom- 
position' of a single choice into two successive choices, the latter of which will 
only be made half the time. Shannon's third requirement says that the uncer- 
tainty in p will be given by iï(|, |, i) = -ff (|, 3) + \H(%, |): the uncertainty of 
the overall choice is equal to the uncertainty of the first stage of the choice, plus 
the uncertainty of the second choice weighted by its probability of occurrence. 

'Grouping axiom' is an appropriate name. As it is standardly understood 
(see e.g. Ash (1965), Uffink (1990), Jaynes (1957)), we consider that instead 
of giving the probabilities p\, . . . ,p n of the outcomes Xi, . . . ,x n of a probabilis- 
tic experiment directly, we may imagine grouping the outcomes into composite 
events (whose probabilities will be given by the sum of the probabilities of thcir 
respective component events), and then specifying the probabilities of the out- 
come events conditional on the occurrence of the composite events to which 
they belong; this way of specifying the probabilistic experiment being precisely 
equivalent to the first. So we might group the first k events together into an 
event A, which would have a probability p(A) — Yli=iPi> ano - the remaining 
n — k into an event B of probability p(B) = Xà=fc+i Pi i an d then give the con- 
ditional probabilities of the events Xi, . . . , Xk conditional on composite event A 
occurring, (pi / p(A)) , . . . , (pk / p{A)) , and similarly the conditional probabilities 
for the events Xk+i, ■ ■ ■ , x n conditional on event B. The grouping axiom then 
concerns how the uncertainty measures should be related for these different de- 
scriptions of the same probabilistic experiment. It says that our uncertainty 
about which event will oceur should be equal to our uncertainty about which 
group it will belong to plus the expected value of the uncertainty that would 
remain if we were to know which group it belonged to (this expected value be- 
ing the weighted sum of the uncertainties of the conditional distributions, with 
weights given by the probability of the outeome lying within a given group). 

So in particular, let us imagine an experiment with n+ 1 outcomes which we 
label ai, 0,2, . . . , a n -U ^2, having probabilities p\ , . . . ,p n -i, Qi, Ç2 respectively. 
We can define an event a n = b± U b%, b± fi b% — 0, which would have probability 
Pn = <7i + 12 and the probabilities for b\ and 62 conditional on a n occurring will 
then be y- respectively. Grouping Axiom ^ says that the uncertainty in the 
occurrence of events a\, et 2 , . . . , a„_i, òi, b 2 is equal to the uncertainty for the 
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occurrence of events a\,. . . ,a n plus the uncertainty for the occurrence of 61, b 2 
conditional on a n occurring, weighted by the probability that a n should occur. 

Brukner and Zeilinger suggest that the grouping axiom, however, embodies 
certain classical presumptions that do not apply in quantum mechanics. This 
entails that the axiomàtic derivation of the forní of the Shannon measure does 
not go through and that the Shannon information ceases to be a measure of 
uncertainty in the quantum context. The argument turns on their interpretation 
of the grouping axiom, which differs from the Standard interpretation in that it 
refers to joint experiments. 

3.2.1 Brukner and Zeilinger's interpretation 

If we take an experiment, A, with outeomes ai,...,a n and probabilities 
{p(ai), . . . ,p(a n )) = (pi, . . . ,p n ) and an experiment, B, with outeomes 61,62, 
then for the joint experiment A A B, the event a n is the union of the two 
disjoint events a n A 61 and a n A 62. Let us assign to these two events the 
probabilities gi and q 2 respectively. Then p(a n ) = p(a n A 61) + p(a n A 62) = 
Çi + <ïi = Pn- On this interpretation, the left hand side of Grouping Axiom 
to be understood as denoting the uncertainty in the experiment with outeomes 
ai, a 2 , ■ ■ • , a„_i, a n A &i, a n A 6 2 . 

If a n oceurs, the conditional probabilities for b\, 62 will be p(a n Abi)/p(a n ) = 
qi/p n ,p(a n /\b 2 ) /p{a„) = q-ilPn respectively, and so H(jfc, ^) is the uncertainty 
in the value of B given that a n oceurs. 

The grouping axiom can now be rewritten as: 

Grouping Axiom 2 (Brukner and Zeilinger) 

H(p(ai),p(a 2 ), ■ ■ • ,p(a n -i),p(a„ A b 1 ),p{a n A 6 2 )) 

= H (p(ai),p(a 2 ), . . . ,p(a n )) + p(a n )H (p(bi\a„),p(b 2 \a„)) . (3) 

Generalizing to the case in which we have m outeomes for experiment B and 
distinguish B vàlues for all n A outeomes, so that we have mn outeomes otj A 6y, 
the grouping axiom becomes: 

Grouping Axiom 3 (Brukner and Zeilinger) 

H(A AB) = H(A) + H{B\A) 

From the point of view of Shannon's original presentation, this expression ap- 
pears as a theorem rather than an axiom, being a consequence of the logarithmic 
form of the Shannon information and the dcfmition of the conditional entropy. 

3.2.2 The inapplicability argument 

The classical assumptions made explicit, Brukner and Zeilinger suggest, in 
Grouping Axioms[21and|niare that attributes corresponding to all possible mea- 
surements can be assigned to a system simultaneously (in this case, ai,bj and 
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ai A bj)\ and that measurements can be made ideally non-disturbing. Group- 
ing Axiom [3 for example, is supposed to express the fact that classically, the 
information we expect to gain from a joint experiment A A B, is the same as 
the information we expect to gain from first performing A, then performing B 
(where the uncertainty in B is updated conditional on the A outcome, but our 
ability to predict B outcomes is not degraded by the A measurement). 

Their inapplicability argument is simply that as the grouping axiom requires 
us to consider joint experiments, the uniqueness proof for the Shannon informa- 
tion will fail in the quantum context, because we can consider measurements of 
non-commuting observables and the joint probabilities on the left hand side of 
Grouping Axiom [21 will not be defined for such observables; thus the grouping 
axiom will fail to hold. Furthermore, the grouping axiom shows that the Shan- 
non information embodies classical assumptions, so the Shannon measure will 
not be justified as a measure of uncertainty because these assumptions do not 
hold in the quantum case. The result is that 

...only for the special case of commuting, i.e., simultaneously definitc 
observables, is the Shannon measure of information applicable and 
the use of the Shannon information justified to define the uncertainty 
given before quantum measurements are performed. (Brukner and 
Zcilinger 2001:4, my emphasis) 

This argument is problemàtic, however. Let us begin with the obvious point 
that a failure of the argument for uniqueness does not automatically rule out 
the Shannon information as a measure of uncertainty. In fact, the Shannon 
information can be seen as one of a general class of measures of uncertainty, 
characterised by a set of axioms in which the grouping axiom does not appear 
(Uffink 1990), hence the grouping axiom is not necessary for the interpretation 
of the Shannon information as a measure of uncertainty. (Uffink in fact has 
previously argued that the grouping axiom is not a natural constraint on a 
measure of information and should not be imposed as a necessary constraint, 
even in the classical case (Uffink 1990, §1.6.3).) So from the fact that on the 
Brukner/Zcilingcr reading, the grouping axiom seems to embody some classical 
assumptions that do not hold in the quantum case, it does not follow that the 
concept of the Shannon information as a measure of uncertainty involves those 
classical assumptions. 

Furthermore, Brukner and Zeilinger's grouping axiom is not in fact equiva- 
lent to the Standard form and the Standard form is equally applicable in both 
the classical and quantum cases. Thus the Shannon information has not been 
shown to involve classical assumptions and the Standard axiomàtic derivation 
can indeed go through in the quantum context. The probabilities appearing in 
Grouping Axiom ^ are well defined in both the classical and quantum cases. 

In Brukner and Zeilinger's notation, Grouping AxiomHwould be written as 

H(p(a 1 ),p(a 2 ), ■ ■ ■ ,p(a n -i),p(bi),p(b 2 )) 

= H(p{a\), ■ ■ ■ ,p(a n _x),p(bi V b 2 )) 

+ P (h V b 2 )H(p(b 1 \b 1 V 6a),p(6a|6i V b 2 )) (4) 
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Figure 2: a) in Grouping Axioma] and eqn. the 'B' outcomes b\ and bi 
cannot occur without a n = b\ U 62 occurring. b) in the joint experiment 
scenario, b\ or &2 can occur without a n occurring, but this is made to appear 
the same by coarse graining the joint experiment and only recording B vàlues 
when we get A outcome a n . 

This refers to an experiment with n + 1 outcomes labelled by 01, . . . a n -\, 61, 62 
and the grouping of two of these outcomes together, and is clearly different 
from Grouping Axiom[21(sce Fig. Brukncr and Zcilingcr's Grouping Axiom 
[21 is the result of applying Q to a coarse grained joint experiment in which 
we only distinguish the B outcomes of A A B for A outcome a n , and Grouping 
Axiom is the result of applying the simple grouping axiom to the fine grained 
joint experiment n times. These are not, then, expressions of the grouping 
axiom, but rather demonstrate its effect when actually applied to an already well 
defined joint probability distribution. The absence of certain joint probability 
distributions in quantum mechanics does not, however, affect the meaningfulness 
of the grouping axiom, because in its proper formulation it does not refer to joint 
experiments 5 . (Note that if the outcomes of the experiment in eqn. J3} were 
represented by one dimensional projection operators, the event b\ V b 2 would 
be represented by a sum of orthogonal projectors which commutes with the 
remaining projectors; a similar relation (coexistence) holds if the outcomes are 
represented by POV elements (effects), Busch et al. (1996).) 

Thus we see that the explicit argument fails. There may remain, however, 
a certain intuitive one. Brukner and Zcilinger are perhaps suggesting that we 
miss something importantly quantum by using the Shannon information as its 
derivation is restricted to the case of commuting observables. Or rather (since we 
have seen that the grouping axiom does not explicitly refer to joint experiments 
and we know that it is not in fact necessary for the functioning of the Shannon 
information as a measure of uncertainty anyway), because it can only teli us 

5 To see that the Standard case and the joint experiment case are mathematically distinct, 
note that the joint experiment formalism cannot express the situation in which B events only 
happen if the a n event oceurs. For that we would require that p(a n ) = p(bi) +p(í>2)i but then 
the marginal distribution for B outcomes in the joint experiment does not sum to unity as is 
required for a well defined joint experiment, 52jP(ï>i) = P( a »0 7^ 1 m general. 
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about the uncertainty in joint experiments for mutually commuting observables 
rather than the full set of observables. However, when we recali that a measure 
of uncertainty is a measure of the spread of a probability distribution, we see 
that this simply amounts to the truism that one cannot have a measure of 
the spread of a joint probability distribution unless one actually has a joint 
probability distribution, and it in no way implies that there is anything un- 
quantum about the Shannon information itself. 

The fact that joint probability distributions cannot be defined for all possi- 
ble groupings of experiments that we might considcr docs not teli us anything 
about whether a certain quantity is a good or bad measure of uncertainty for 
probability distributions that can be defined (e.g. any probability distribution 
derived from a quantum state by the trace rule, or conditional probabilities 
given by the Lüders rule). We must be careful not to confuse the question of 
what makes a good measure of uncertainty with the question of when a joint 
probability distribution can be defined. 

We already know that a function of a joint probability distribution cannot 
be a way of tclling us how much we know, or how uncertain we are, in general 
when we know the state of a system, because we know that a joint probability 
distribution for all possible measurements does not exist. It is for this reason in 
quantum mechanics that we introduce measures of mixedness such as the von 
Neumann entropy, which are functions of the state rather than of a probability 
distribution. It is not a failing of the Shannon information as a measure of 
uncertainty or expected information gain that it does not fulfil the same role. 
Part of Brukner and Zeilinger's worry about the Shannon information thus 
seems to arise because they are trying to treat it too much like a measure of 
mixedness, a measure of how uncertain we are in general when we know the 
state of a quantum system 6 . 

6 This is illustrated for example in their reply to criticism of their grouping axiom ar- 
gument by Hall (Brukner and Zcilingcr 2000b, Hall 2000). Hall presents an interpretation 
of the grouping axiom concerning the increase in randomness on mixing of non-overlapping 
distributions, to which Brukner and Zeilinger's worries about joint experiments would not 
apply. Their reply, in essence, is that the density matrix cannot be simultancously diagonal 
in non-commuting bases, therefore it cannot be thought to be composed of non-overlapping 
classical distributions, hence Hall's grouping axiom will not apply, further supporting their 
original claim that the Shannon measure is tied to the notion of classical properties. What 
this reply in fact establishes, however, is that Hall's axiom applied to mixtures of classical 
distributions is not relevant to characterising the randomness of the density matrix; but this 
is something with which everyone would agree, and this job certainly not one for which the 
Shannon information is intended. (If we did wish to use the grouping axiom in characterising 
the randomness of the density matrix, we would apply Hall's version to mixtures of density 
operators with orthogonal support; this would then pick out the von Neumann entropy up to 
a constant factor (Wehrl 1978).) 
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4 Brukner and Zeilinger's 'Total information con- 
tent' 



The final argument proposed against the Shannon information is that it is not 
appropriately related to a notion of 'total information content' for a quantum 
system. It is also suggested that the von Neumann entropy, which has a natural 
relation to the Shannon information, is not a measure of information content as 
it makes no explícit reference to information gain from measurements in general 
(Brukner and Zeilinger 2001, 2000b). 

In place of the Shannon information, Brukner and Zeilinger propose the 
quant ity 

í(d=JV£(ft-£) . (5) 

i=l ^ ' 

from which they derive their notion of total information content as follows: 

A set of measurements is called mutually unbiased or complementari/ 
(Schwinger 1960) if the sets of projectors {P},{Q} associated with any pair 
of measurement bases satisfy Tr(PQ) = í/n, where n is the dimensionality of 
the system. There can exist at most n+1 such bases (Wootters and Fields 1989), 
constituting a complete set, and as was shown by Ivanovic (1981), measurement 
of such a complete set on an ensemble of similarly prepared systems determines 
their density matrix p completely. In analogy to acquiring the information as- 
sociated with a (pointlike) classical system by learning its state (determining 
its position in phase space), Brukner and Zeilinger then suggest that the total 
information content of a quantum system should be given by a sum of informa- 
tion measures for a complete set of mutually unbiased measurements. Taking 
I(p) as the measure of information, the result is a unitarily invariant quantity: 

n+1 ^ , 1 v 2 , - x 2 

í« = E¥) = E fí-J =Tl [ p -n) ■ (6) 

3 = 1 ji V 7 V 7 

This also provides their argument against the Shannon information. It is 
a necessary constraint on a measure of total information content, they argue, 
that it be unitarily invariant, but substituting H(p) for I(p) in (jüj does not 
result in a unitarily invariant quantity, that is, we do not have a sum to a 'to- 
tal information content'. Let us call the requirement that a measure sum to a 
unitarily invariant quantity for a complete set of mutually unbiased measure- 
ments the 'total information constraint'. The suggestion is that the Shannon 
measure is inadequate as a measure of information gain because it does not 
satisfy the total information constraint and hcncc does not teli us how much of 
the total information content of a system we learn by performing measurements 
in a given basis. Similarly, the complaint against the von Neumann entropy is 
that it is merely a measure of mixedness, as unlike Itot, it has no relation to 
the information gained in a measurement unless we happen to measure in the 
eigenbasis of p. 
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A few remarks are in order. First, I(p) and Itot are not unfamiliar expres- 
sions. The quantity J^íÍPí ~ V n ) 2 i s one °f the class of measures of the concen- 
tration of a probability distribution given by UfBnk (1990), and Fano (1957), 
for example, remarks that Tr(p 2 ) can serve as a good measure of information; 
furthermore, the relation expressed in eqn. ijïïjl has previously been employed by 
Larsen (1990) in discussing exact uncertainty relations. Note also that I(p) is 
an increasing function of the concentration of a probability distribution, hence 
a measure of how much we know given a probability distribution, rather than 
being a measure of uncertainty like H(p); similarly I to t is an increasing function 
of the purity of p. 

More importantly, however, 'information content' might mean several diffcr- 
ent things. It may not, then, be reasonable to require that every meaningful 
information measure sum to a unitarily invariant quantity that can be inter- 
preted as an information content. Moreover, we may well ask why information 
measures for a complete set of mutually unbiased measurements should be ex- 
pected to sum to any particularly interesting quantity. That the measure I(p) 
happens to sum to a unitarily invariant quantity is, as we shall presently see, 
the consequence of a geomètric property tangential to its role as a measure of 
information. 

4.1 Some Different Notions of Information Content 

It is useful to distinguish between the information encoded in a system, the 
information required to specify the state of a system (more precisely, the infor- 
mation required to specify a sequence of states drawn from a given ensemble) 
and states of complete and less than complete knowledge or information. Each 
of these can serve as a notion of information content in an appropriate context. 
In the classical case, their differences can be largely ignored, but in the quantum 
case there are important divergences. It is, for instance, necessary to introduce 
the concept of the accessible information to characterise the differencc between 
information encoded and specification information (Schumacher 1995). 

If we consider encoding the outputs of a classical information source A 
into pure states |<ij) of an ensemble of quantum systems, then the state of the 
ensemble will be given by p = ^2 i p{ai)\ai){ai\. The von Neumann entropy, 
S{p) = — Trplogp, is a measure of how mixed this state is, giving us one sense 
of information content - the more mixed a state, the less information we have 
about what the outeome of measurements on systems described by the state 
will be 7 . 

If we are presented with a sequence of systems drawn from an ensemble 
prepared in this manner, each will be in one of the pure states, and the number 
of bits per system required to specify this sequence will be H(A), the information 

7 Mixed states are also sometimes said to be states of less than complete information due 
to a lack of information about the way a system was prepared, represented by a probability 
distibution over possible pure states. Our reading is to be preferred given the many-one 
relation of preparation procedures to density operators and the fact that density operators 
can also result from tracing out unwanted degrees of freedom. 
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of the classical source (which will be greater than S(p) unless we have coded 
in orthogonal states). This is the specification information, also the amount of 
information required to prepare the sequence. For the amount of information 
that has actually been encoded into the systems, however, we need to consider 
measurements on the ensemble and the Shannon mutual information H(A : B). 

As already remarked ÍSection léi.lfl . if we encode using a certain basis (our 
\di) form an orthonormal set) and we measure in a different basis, then H(A : 
B) < H{A); quantum 'noise' reduces the amount we can decode. More signifi- 
cantly, if we have coded in non-orthogonal states (if, for example the number of 
outputs of our classical source is greater than the dimensionality of our quan- 
tum systems), then no measurement can distinguish thesc states pcrfcctly and 
we cannot recover the complete classical information. Taking 'encoded' to be a 
'success' word (something cannot be said to have been encoded if it cannot in 
principle be decoded) , then the maximum amount of information encoded in a 
system is given by the accessible information, the maximum over all decoding 
observables of the mutual information. Using non-orthogonal coding states, the 
amount we can encode is less than the specification information. A well known 
result due to Holevo (1973) provides an upper bound on the mutual information 
resulting from measurement of any observable (including POV measurements). 
For the general case of encoding states p ai , this is given by 

H(A : B) < S(p) - Çp(a,)£(p« 4 ), 

i 

which in the case we are considering of pure encoding states, reduces to H(A : 
B) < S(p). This provides a very strong sense in which the von Neumann entropy 
does give us a notion of the total information content of a quantum system — 
it is the maximum amount that can actually be encoded in the system. 

Brukner and Zeilinger do not consider a quantum communication channcl 
but are concerned rather with the information content of a single system con- 
sidered in isolation. This information content is supposed to relate to how much 
we learn from learning the state, but if the system is being treated in isolation 
then by learning its state we are not acquiring a certain amount of information 
in virtue of the state being drawn from a given ensemble, as in the Standard 
notion of information. (Hcnce their analogy with gaining the information con- 
tent of a classical system fails to hold.) In fact, their 'total information content' 
seems best interpreted as a measure of mixedness analogous to the von Neumann 
entropy. 

Whcn introduced (Brukner and Zeilinger 1999), the information measure 
I(p) is presented as a measure of how much we know about what the outcome 
of a particular experiment will be, given the state. The total information of the 
state, then, would seem to be a measure of how much we know in general about 
what the outcomes of experiments will be given the state; and this is precisely 
a question of the degree of mixedness of the state 8 . 

8 Recently it has been noted that Itot is also related to the average distance of our estimate 
of the unknown state from the true state (measured in the Hilbert-Schmidt norm), given 
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4.1.1 Measures of mixedness 

The functioning of measures of mixedness can usefully be approached via the 
notions of majorization and Schur convexity (concavity). The majorization 
relation -< imposes a pre-order on probability distributions (Uffink 1990, Nielsen 

2001) . A probability distribution q is majorized by p, q -< p, iff qi = ■ SijPj, 
where Sij is a doubly stochastic matrix. That is (via Birkhoff's theorem), if q 
is a mixture of permutations of p. Thus if q -< p, then q is a more mixed or 
disordered distribution than p. 

Schur convex (concave) functions respect the ordering of the majorization 
relation: a function / is Schur convex if, if q ~< p then f(q) < f(p), and Schur 
concave if, if <f -< p then f{q) > f(p) (for strictly Schur convex(cave) functions, 
equality holds only if q and p are permutations of one another) . This explains 
the utility of such functions as measures of the concentration and uncertainty 
of probability distributions, respectively. 

The majorization relation will apply equally to the vectors of eigenvalues of 
density matrices. It can be shown that the vector of eigenvalues X' of the density 
matrix p' of the post measurement ensemble for a (non-selective) projective 
measurement is majorized by the vector of eigenvalues A of the pre-measurement 
state p (Nielsen 2001). (If we measure in the eigenbasis of p, then there is, 
of course, no change in the eigenvalues). The A^ are just the probabilities of 
the diffcrent outcomes of the measurement in question, thus the probability 
distribution for the outcomes of any given measurement will be more disordered 
or spread than the eigenvalues of p. 

If we take any Schur concave function we know to be a measure of uncer- 
tainty, for instance the Shannon information H(p), and p is the probability 
distribution for measurement outcomes, we then know that H(p) > £f(A), for 
any projective measurement we might pcrform. This explains why H{\) = S(p), 
a measure of mixedness, is a measure of how much we know given the state: 
the more mixed a state, the more uncertain we must be about the outcome 
of any given measurement. Similarly, if we take a measure of the concentra- 
tion of a probability distribution, a Schur convex function such as I(p), then 
we know that for any measurement with outcome probability distribution p, 
I(p) < I(X) = I to t\ and this explains why l to t is a measure of how much we 
know given p: the less the value of ltot, the less able we are to predict the 
outcome of any given experiment. 

Brukner and Zeilinger would of course deny that their total information 
content is mcrely a measure of mixedness. The argument that it is more than 
this rests on the satisfaction of the total information constraint, the relation 
between the measure of information I(p) and ltot for a completo set of mutually 
unbiased measurements as expressed in eqn. (JHJ). We shall now see that this 

only a finite number N of experiments in each mutually unbiased basis (Rehacek and Hradil 

2002) . This seems best understood as indicating that the mean error in our state estimation 
is inversely related to N, with a constant of proportionality that depends on the dimension of 
the system and the mixedness of the state. In any case, Brukner and Zeilinger are primarily 
interested in how much we know when the state has been determined to arbitrary accuracy. 
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rclation can be given a simple geomètric explanation using the Hilbcrt-Schrnidt 
representation of density operators. 



4.2 The Relation between Total Information Content and 



The set of complex n x n Hermitian matrices forms an n 2 -dimensional real 
Hilbert space Vh(C n ) on which we have defincd an inner product (A, B) = 
Tr(AB); A, B e V h (C n ) and a norm ||A|| = ^Tr^ 2 ) (Fano 1957; Wichmann 
1963). The density matrix p of an n dimcnsional quantum system can be rep- 
resented as a vector in this space. The requirements on p of unit trace and 
positivity imply that the tip of any such vector must lie in the n 2 — 1 dimen- 
sional hypcrplane T a distance í/\/n from the origin and perpendicular to the 
unit operator 1, and on or within a hypcrsphcre of radius one centred on the 



It is useful to introduce a set of basis operators on our space; we require 
n 2 linearly independent operators Ui € Vh(C n ) and it may be useful to require 
orthogonality: Tr^íX,) = const. x Sij. Any operator on the system can then 
be expandcd in terms of this basis and in particular, p can be written as 



where we have chosen Uo = 1 to take care of the trace condition. 

Evidently, p may be determined experimentally by finding the expectation 
vàlues of the n 2 — 1 operators [/, in the state p. If we include the operator 1 in 
our basis set, then the idempotent projectors associated with measurement of 
any maximal (non-degenerate) observable will provide a maximum of a further 
n — 1 linearly independent operators. Obtaining the probability distribution for 
a given maximal observable will thus provide n — 1 of the parameters required 
to determine the state, and the minimum number of measurements of maximal 
observables that will be needed in total is n + 1, if each observable provides a 
full complement of linearly independent projectors. 

Each such set of projectors spans an n— 1 dimensional hyperplane in Vh(C n ) 
and their expectation vàlues specify the projection of the state p into this hypcr- 
plane. Ivanovic (1981) noted that projectors P, Q belonging to any two different 
mutually unbiased bases will be orthogonal in T, hence the hyperplanes associ- 
ated with measurement of mutually unbiased observables are orthogonal in the 
space in which density operators are constrained to lie in virtue of the trace 
condition. If n + 1 mutually unbiased observables can be found, then, Vh(C n ) 
can be decomposed into orthogonal subspaces given by the one dimensional 
subspace spanned by 1 and the n + 1 subspaces associated with the mutually 
unbiased observables. The state p can then be expressed as: 



I(p) 



origin. 
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where P( = P( — 1/n is the projection onto T of the zth idempotent projector in 
the jth mutually unbiased basis set, and qf = (jjj — 1/n) is the expectation value 
of this operator in the state p. For a given value of j, the vectors Pi span an 
(n—í) dimcnsional orthogonal subspace and the squarc of the lcngth of a vector 
expressed in the form Q lying in subspace j will be given by X)™=i(<3i Y = li? 3 )- 

The geometrical explanation of Itot is then simply as follows. Tr(p 2 ) is the 
square of the length of p in Vh {C n ) and Tr (p — 1 jn) 2 is the square of the distance 
of p from the maximally mixed state (the length squared of p in T) . This squared 
length will just be the sum of the squares of the lengths of the components of 
the vector p — 1/n in the orthogonal subspaces into which we have decomposed 
Vh(C n ), i.e. it will be given by J2ji(Qi) 2 - This is what eqn. JüJ) reports and it 
explains how I to t and I(p) satisfy the total information constraint. 

Thus we see that if I to t differs from being a simple measure of mixedness, 
then that is because it is a measure of length also; and this explains why it can 
be given by a sum of quantities I(p) for a complete set of mutually unbiased 
measurements. As measures of how much we know given the state, however, I to t 
and S(p) bear the same relation to their appropriate measures of information, 
as we saw in the previous section. Equally, as measures of information, H(p) 
stands to S(p) in the same relation as I(p) to I to t- Itot is the upper bound on 
the amount we can know about the outcome of a measurement as measured by 
I(p); S(p) is the lower bound on our uncertainty about what the outcome of a 
measurement will be, as measured by H(p). 

The complaint against the Shannon information was supposed to be that 
as H(p) fails to satisfy the total information constraint, it does not teli us 
the information gained from a particular measurement; the complaint against 
the von Neumann entropy that as S(p) is not given by a sum of measures for a 
complete set of mutually unbiased measurements, it is not suitably related to the 
information gained from a measurement. However, we can now see that insisting 
on the total information constraint in this way is tantamount to insisting that 
only a function which measures the length of the component of p lying in a 
given hyperplane can be a measure of information, and correlatively, that the 
only viable notion of total information content is a measure of the length of p 
in Vh(C n ). But H(p) can be a perfectly good measure of information without 
having to be a measure of the length of the projection of p into the subspace 
associated with an observable; and as we have just seen, S(p) does have an 
explicit relation to the information gain from measurement that justifies its 
interpretation as a total information content. A relation, moreover, that Itot 
also possesses and which serves to justify its interpretation as a measure of how 
much we know given the state. 

Hcnce our conclusion must be that the total information constraint is not a 
reasonable requirement on measures of information. 

Of course, H(p) does not teli us the information gain on measurement it we 
take, as Brukner and Zeilinger seem to, 'the information encoded in a basis' 
simply to mean the length squared of the component of the state lying in the 
measurement hyperplane; but this is a non-standard usage. H(p) certainly re- 
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mains a measure of our expected information gain from pcrforming a particular 
measurement (how much the outcome will surprise us, on average, given triat 
we have the probability distribution); and if we are interested in the amount 
of information encoded, in the usual sense of the word, that can be decoded 
using a particular measurement, i.c. if we have a string of systems into which 
information has actually been encoded, then we may always just consider the 
Shannon mutual information associated with that measurement. (The 'total 
information' associated with this quantity will then bc given, via the Holcvo 
bound, as the von Neumann entropy.) 

5 Conclusion 

Of the three arguments that Brukner and Zeilinger have presented against the 
Shannon information, we have seen that the first two fail outright. These ar- 
guments sought to establish that the notion of the Shannon information is un- 
dcrmincd in the quantum context due to a reliance on classical concepts. With 
regard to the first we saw that, contrary to Brukner and Zeilinger, the existence 
of a pre-determined string is neither necessary nor sufficient for the interpreta- 
tion of H(p) as a measure of information, hence the absence of such a string 
would not cause any problcms for the Shannon information in quantum mc- 
chanics. 

The objective of their second argument was to highlight classical assump- 
tions in the grouping axiom that would prevent the axiomàtic derivation of the 
Shannon information going through in the quantum case. This argument turned 
out to be based on an erroneous reading of the grouping axiom that appeals 
to joint experiments. The grouping axiom is in fact perfectly well defined in 
the quantum case and the Standard axiomàtic derivation of the form of H(p) 
can indeed go through. The grouping axiom docs not reveal any problemàtic 
classical assumptions implicit in the Shannon information. 

In their final argument, Brukner and Zeilinger suggest that defining the 
notion of the total information content of a quantum system in terms of the 
Shannon information would lead to a quantity with the unnatural property of 
unitary non-invariance. But this is not a compelling argument against the Shan- 
non quantity as a measure of information. We have seen that it is not a necessary 
requirement on every meaningful measure of information that it sum to a uni- 
tarily invariant quantity for a complete set of mutually unbiased measurements; 
nor, conversely, is it necessary that every viable notion of total information 
content be given by such a sum of individual mcasurcs of information. 

Brukner and Zeilinger's arguments thus fail to establish that the Shannon 
information involves any particularly classical assumptions or that there is any 
difficulty in the application of the Shannon measure to measurements on quan- 
tum systems. The Shannon information is perfectly well defined and appropriate 
as a measure of information in the quantum context as well as in the classical. 
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A Appendix: Zeilinger's Foundational Principle 



As remarked in the introduction, Brukner and Zeilinger's discussion of the Shan- 
non information follows on from Zeilinger's (1999) proposal of an information- 
tl·icoretic foundational principio for quantum mechanics. This principio, he sug- 
gests, is to play a role in quantum mechanics similar to that of the Principio 
of Rclativity in Special Relativity, or to the Principio of Equivalence in Gen- 
eral Relativity. Like these, the Foundational Principle is to be an intuitively 
understandable principle which plays a key role in deriving the structure of the 
theory. In particular, Zeilingcr suggests that the Foundational Principle pro- 
vides an explanation for the irreducible randomness in quantum measurement 
and for the phenomenon of cntanglcmcnt. For discussion of how the Founda- 
tional Principle motivates the arguments against the Shannon information, see 
Timpson (2002); in this Appendix we shall consider whether the Principle can 
indeed be successful as a foundational principle for quantum mechanics. 

Bcfore stating the Foundational Principle, it is helpful to identify two philo- 
sophical assumptions that Zeilinger's position incorporates. The first is a form 
of phenomenalism: physical objeets are taken not to exist in and of themselves, 
but to be mere construets relating sense impressions (Zeilinger 1999:633) 9 ; the 
second assumption is an explicit instrumentalism about the quantum state: 

The initial state... represents all our information as obtained by ear- 
lier observation...[the time evolved] state is just a short-hand way 
of representing the outeomes of all possible future observations. 
(Zeilinger 1999:634) 

With these assumptions noted, let us consider the two distinct formulations of 
the Principle presented in Zeilinger (1999): 

FP 1 An elementary system represents the truth value of one proposition. 
FP 2 An elementary system corries one bit of information. 

At first glance, these two statements appear most naturally to be concerned 
with the amount of information that can be encoded into a physical system. 
Howcver, this interpretation is at odds with the passage in which Zeilinger 
motivates the Foundational Principle. In this passage, his concern is with the 
number of propositions required to describe a system. He considers the analysis 
of a composite system into constituent parts and remarks that it is natural 
to assume that each constituent system will require fewer propositions for its 
description than the composite does. The end point of the analysis will be 
reached when we have systems described by a singlo proposition only; and it is 
these systems that are termed 'elementary'. 

The situation is clarified when Zeilinger goes on to explain what he means 
by an elementary system carrying or representing some information: 

9 Here I take phenomenalism to be the doctrine that the subject matter of all conceiv- 
able propositions are one's own actual or possible experiences, or the actual and possible 
experiences of another. 
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...that a systcm "rcpresents" the truth value of a proposition or 
that it "carries" one bit of information only implics a statcmcnt 
concerning what can be said about possible measurement results. 
(Zeilinger 1999:635) 

Thus the Foundational Principio is not a constraint on how much information 
can be encoded into a physical systcm. It is a constraint on how much the state 
of an elementary system can say about the results of measurement. This inter- 
pretation is rendered consistent with the discussion in terms of the propositions 
required to describe a system, as from Zeilinger's instrument alist point of view, 
describing (the state of) a quantum system can only be to make a claim about 
future possible measurement results. Furthermore, we can understand the pe- 
culiar idiom of a system 'representing' some information, where this is taken 
not to refer to the encoding of some information into a system, when we recali 
that from the point of view of Zeilinger's phenomenalism, a physical system is 
not an actual thing. On his view, a system represents a quantity of information 
about measurement results because a physical system literally is nothing more 
than an agglomeration of actual and possible sense impressions arising from 
observations. 

In short, however, it seems that a clearer, and perhaps more philosophically 
neutral, statement of the Foundational Principle would be the following: 

FP 3 The state of an elementary system specifies the answer to a single yes/no 
experimental question, 

where we have used the fact that by 'proposition' Zeilinger means something that 
represents an experimental question. With this relatively clear statement of the 
Foundational Principio in hand, lct us now consider its claims as a foundational 
principio for quantum mechanics. 

To begin with, we should note the limitations implied by Zeilinger's con- 
ception of the description of a system. It might not always be the case that 
the state of an individual system can be characterised appropriately as a list of 
experimental qüestions to which answers are specified; and in such a case, the 
terms of the Foundational Principle cannot be set up. Consider the de Broglie- 
Bohm thcory, for example, with its elements of holism and contextuality — even 
though the theory is deterministic, the results of measurements are in general 
not determined by the properties of the object system alone but are the result 
of interaction bctwecn object system and measuring device. It would seem that 
this theory could neither be supported nor rulcd out by the Foundational Prin- 
ciple, as we can neither identify something that would count as an elementary 
system in this theory, given the way 'elementary system' has been defined, nor, 
a fortiori, begin to enumerate how many experimental qüestions such an entity 
might specify. However, for present purposes, let us put this sort of worry to 
one side. 

Another concern arises when considering the distinction we have drawn be- 
tween describing a system and encoding information into it. Unlike encoding, 
the notion of describing a system presupposes a certain language in which the 
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description is made, and the description of a given system could bc longcr or 
shorter depending on the conceptual resources of the language used. If we arc to 
makc a claim about the number of propositions required to describe a system, 
then, as we must when identifying an elementary system to figurc in the Foun- 
dational Principle, we must already have made a choice of the set of concepts 
with which to describe the system. But this is worrying if the purpose of the 
Foundational Principle is to serve as a basis from which the structure of our 
thcory is to be derived. If we already have to make substantial assumptions 
about the correct terms in which the objeets of the theory are to be described, 
thcn it may be that the Foundational Principle will be debarred from serving 
its foundational purpose. With this worry in mind, let us now consider the 
first of the concrete claims for the Foundational Principle, that it explains the 
irreducible randomness of quantum measurements. 

Zeilingcr's suggestion is that we have randomness in quantum mechanics 
because: 

...an elementary system cannot carry enough information to provide 
definite answers to all qüestions that could be asked experimentally 
(Zcilinger 1999:636), 

and this randomness must be irreducible, because if it were reduced to hid- 
den properties, then the system would carry more than one bit of information. 
Unfortunatcly, this does not constitute an explanation of randomness, even if 
we have granted the existence of elementary systems and adopted the Founda- 
tional Principle. For the following question remains: why is it that experimental 
qüestions exist whose outeome is not already determined by a specification of 
the finest grained state description we can offer? How is it that any space for 
randomness remains? Or again, why isn't one bit enough? The point is, it 
has not been explained why the state of an elementary system cannot specify 
an answer to all experimental qüestions: this does not in fact appear to fol- 
low from the Foundational Principle. The Foundational Principle says nothing 
about the structure of the set of experimental qüestions, yet this turns out to 
be all-important. 

Consider the case of a classical Ising model spin, which has only two possible 
states, 'up' or 'down'; here one bit, the specification of an answer to a singlc 
experimental question ('Is it up?') is enough to specify an answer to all qüestions 
that could be asked. There is no space for randomness here, yet this classical 
case is perfectly consistent with the Foundational Principle. Thus it seems that 
no explanation of randomness is fortheoming from the Foundational Principio 
and furthermore, it is far from clear that the Principle, on its own, in fact allows 
us to distinguish between quantum and classical. 

Of course, if one assumes that experimental qüestions are represented in 
the quantum way, as projectors on a Hilbert space, then even for the simplcst 
non-trivial state space, there will be non-equivalent experimental qüestions, the 
answer to one of which will not provide an answer to another; but we cannot 
assume this structure if it is the very structure that we are trying to derive. 
It appears from the way in which the Foundational Principle is supposed to be 
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functioning in the attempted explanation of randomness, that somcthing like the 
quantum structurc of propositions is being assumed. But this is clearly fatal to 
the prospects of the Foundational Principle as a foundational principle. 10 

Does the Principle fare any better with the proposed explanation of entan- 
glement? The idea here is to consider N elementary systems, which, following 
from the Foundational Principle, will have N bits of information associated 
with thcm. The suggestion is that entanglement rcsults when all N bits are 
exhausted in specifying joint properties of the system, leaving none for individ- 
ual subsystems (Zeilinger 1999), or more generally, when more information is 
uscd up in specifying joint properties than would be possible classically. The 
underlying thought is that this approach captures the intuitive idea that when 
we have an entangled system, we know more about the joint system (which may 
be in a pure state) than we do about the individual sub-systems (which must be 
mixed states). The proposal is further developed in (Brukncr et al 2001), where 
Brukner and Zeilingcr's information measure is uscd to provide a quantitative 
condition for TV qubits to be unentangled, which is then related to a condition 
for the violation of a certain iV-party Bell inequality. 

To give a bàsic example of how the idea is supposed to work, consider the 
case of two qubits. Recali that the maximally entangled Bell states are joint 
eigenstates of the observables a x ® a x and a y ® o~ y . From the Foundational 
Principle, only two bits of information are associated with our two systems, i.e. 
the states of these systems can specify the answer to two experimental qüestions 
only. If the two qüestions whose answers are specified are 'Are both spins in the 
same direction along £?' (1/2(1 ® 1 + a x ® <J x j) and 'Are both spins in the same 
direction along yT (1/2(1 ® 1 + a y ® <J y )), then we end up with a maximally 
entangled state. If, by contrast, the two qüestions had been 'Are both spins in 
the same direction along xT and 'Is the spin of particle 1 up along xl\ the 
information would not have all been used up specifying joint properties and we 
would have instead a product state (joint eigenstate of a x ® a x and o~ x ® 1). 

Now, although this idea may have its attractions when used as a criterion for 
entanglement within quantum mechanics, it does not succeed in providing an 
explanation for the phenomenon of entanglement, which was the original claim. 

If we return to the starting point and consider our N elementary systems, 
all that the Foundational Principle telis us regarding these systems is that their 
individual states specify the answer to a single yes/no question concerning each 
system individually. There is, as yet, no suggestion of how this relates to joint 
properties of the combined system. Some assumption needs to be made beforc 
we can go further. For instance, we need to enquire whether there are supposed 
to be experimental qüestions regarding the joint system which can be posed and 
answered that are not equivalent to qüestions and answers for the systems taken 
individually. (We know that this will be the case, given the structure of quantum 

10 In a sense, we could say that Zeilinger's explanation of randomness is problemàtic because 
it fails to explain why the state space of quantum mechanics is so gratuitously large from 
the point of view of storing information (Caves and Fuchs 1996). It is then striking that 
this attempted information-theoretic foundational approach to quantum mechanics has not 
allowcd for one of the significant insights vouchsafed by quantum information theory. 
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mechanics, but we are not allowed to assume this structure, if we are engaged 
in a foundational project. 11 ) If this is the case then there can be a difference in 
the information associated with correlations (i.e., regarding answers to qüestions 
about joint properties) and the information regarding individual properties. But 
then we need to ask: why is it that there exist sets of experimental qüestions 
to which the assignment of truth vàlues is not equivalent to an assignment of 
truth vàlues to experimental qüestions regarding individual systems? 

Bccause such sets of qüestions exist, more information can be 'in the cor- 
relations' than in individual properties. Stating that there is more information 
in correlations than in individual properties is then to report that such sets of 
non-equivalent qüestions exist, but it does not explain why they do so. However, 
it is surely this that demands explanation — why is it not simply the case that 
all truth value assignments to experimental qüestions are reducible to truth 
value assignments to experimental qüestions regarding individual properties, as 
they are in the classical case? That is, why does entanglement exist? In the 
absence of an answer to the question when posed in this manner, the suggested 
explanation following from the Foundational Principle seems dangerously close 
to the vacuous claim that entanglement results when the quantum state of the 
joint system is not a separable state. 

Of course, if we are in the business of looking within quantum mechanics 
and asking how product and entangled states differ, then it is indecd legitimate 
to consider something like the condition Brukncr et al. (2001) propose; and we 
can then consider how good this condition is as a criterion for entanglement 12 . 
But as mentioned before, if we are trying to explain the existence of entan- 
glement then we cannot simply assume the quantum mechanieal structure of 
experimental qüestions. 

Let us close by considering a final striking passage. Zeilinger suggests that 
the Foundational Principio might provide an answer to Wheeler's question 'Why 
the quantum?' (Whceler 1990) in a way congenial to the Bohrian intuition that 
the structure of quantum theory is a consequence of limitations on what can be 
said about the world: 

The most fundamcntal vicwpoint here is that the quantum is a con- 
sequence of what can be said about the world. Since what can be said 
has to be expressed in propositions and since the most elementary 
statement is a single proposition, quantization follows if the most 
elementary system represents just a single proposition. (Zeilinger 
1999:642) 

But this passage contains a crucial non-sequitur. Quantization only follows if 

llr To illustrate, a simultaneous truth value assignment for the experiments a x (B> cr x and 
cr y ® cr y cannot be reduced to one for experiments of the form 1 ® a.cr, b.er (g> 1. 

12 At this point it is worth noting that there have been other discussions of entanglement 
which develop the intuitive idea that when faced with entangled states, we know more about 
joint properties than individual properties. A very general framework is presented by Nielsen 
and Kempe (2001), who use the majorization relation to compare the spectra of the global 
and reduced states of the system; a necessary condition for a state to be separable is then 
that it be more disordered globally than locally. 
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the propositions are projection operators on a complex Hilbert space. And why 
is it that the world has to be described that way? That is the question that 
would need to be answered in answering Wheeler's question; and it is a question 
which, I suggest, the Foundational Principle goes no way towards answering. 
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